Dataframe Basics

We've learned about vectors and their two-dimensional counterpart, matrices. Now we will learn about Dataframes, one of the main tools for data analysis with R! Matrix inputs were limited because all the data inside of the matrix had to be of the same data type (numerics, logicals, etc). With Dataframes we will be able to organize and mix data types to create a very powerful data structure tool!

R actually has built in DataFrames for quick reference to play around with! Check out the following dataframes that are built-in!

In [2]:
# Dataframe about states
state.x77
Out[2]:
PopulationIncomeIlliteracyLife ExpMurderHS GradFrostArea
Alabama 3615.00 3624.00 2.10 69.05 15.10 41.30 20.0050708.00
Alaska 365.00 6315.00 1.50 69.31 11.30 66.70 152.00566432.00
Arizona 2212.00 4530.00 1.80 70.55 7.80 58.10 15.00113417.00
Arkansas 2110.00 3378.00 1.90 70.66 10.10 39.90 65.0051945.00
California 21198.00 5114.00 1.10 71.71 10.30 62.60 20.00156361.00
Colorado 2541.00 4884.00 0.70 72.06 6.80 63.90 166.00103766.00
Connecticut3100.005348.00 1.10 72.48 3.10 56.00 139.004862.00
Delaware 579.004809.00 0.90 70.06 6.20 54.60 103.001982.00
Florida 8277.00 4815.00 1.30 70.66 10.70 52.60 11.0054090.00
Georgia 4931.00 4091.00 2.00 68.54 13.90 40.60 60.0058073.00
Hawaii 868.04963.0 1.9 73.6 6.2 61.9 0.06425.0
Idaho 813.00 4119.00 0.60 71.87 5.30 59.50 126.0082677.00
Illinois11197.00 5107.00 0.90 70.14 10.30 52.60 127.0055748.00
Indiana 5313.00 4458.00 0.70 70.88 7.10 52.90 122.0036097.00
Iowa 2861.00 4628.00 0.50 72.56 2.30 59.00 140.0055941.00
Kansas 2280.00 4669.00 0.60 72.58 4.50 59.90 114.0081787.00
Kentucky 3387.0 3712.0 1.6 70.1 10.6 38.5 95.039650.0
Louisiana 3806.00 3545.00 2.80 68.76 13.20 42.20 12.0044930.00
Maine 1058.00 3694.00 0.70 70.39 2.70 54.70 161.0030920.00
Maryland4122.005299.00 0.90 70.22 8.50 52.30 101.009891.00
Massachusetts5814.004755.00 1.10 71.83 3.30 58.50 103.007826.00
Michigan 9111.00 4751.00 0.90 70.63 11.10 52.80 125.0056817.00
Minnesota 3921.00 4675.00 0.60 72.96 2.30 57.60 160.0079289.00
Mississippi 2341.00 3098.00 2.40 68.09 12.50 41.00 50.0047296.00
Missouri 4767.00 4254.00 0.80 70.69 9.30 48.80 108.0068995.00
Montana 746.00 4347.00 0.60 70.56 5.00 59.20 155.00145587.00
Nebraska 1544.0 4508.0 0.6 72.6 2.9 59.3 139.076483.0
Nevada 590.00 5149.00 0.50 69.03 11.50 65.20 188.00109889.00
New Hampshire 812.004281.00 0.70 71.23 3.30 57.60 174.009027.00
New Jersey7333.005237.00 1.10 70.93 5.20 52.50 115.007521.00
New Mexico 1144.00 3601.00 2.20 70.32 9.70 55.20 120.00121412.00
New York18076.00 4903.00 1.40 70.55 10.90 52.70 82.0047831.00
North Carolina 5441.00 3875.00 1.80 69.21 11.10 38.50 80.0048798.00
North Dakota 637.00 5087.00 0.80 72.78 1.40 50.30 186.0069273.00
Ohio10735.00 4561.00 0.80 70.82 7.40 53.20 124.0040975.00
Oklahoma 2715.00 3983.00 1.10 71.42 6.40 51.60 82.0068782.00
Oregon 2284.00 4660.00 0.60 72.13 4.20 60.00 44.0096184.00
Pennsylvania11860.00 4449.00 1.00 70.43 6.10 50.20 126.0044966.00
Rhode Island 931.04558.0 1.3 71.9 2.4 46.4 127.01049.0
South Carolina 2816.00 3635.00 2.30 67.96 11.60 37.80 65.0030225.00
South Dakota 681.00 4167.00 0.50 72.08 1.70 53.30 172.0075955.00
Tennessee 4173.00 3821.00 1.70 70.11 11.00 41.80 70.0041328.00
Texas 12237.0 4188.0 2.2 70.9 12.2 47.4 35.0262134.0
Utah 1203.0 4022.0 0.6 72.9 4.5 67.3 137.082096.0
Vermont 472.003907.00 0.60 71.64 5.50 57.10 168.009267.00
Virginia 4981.00 4701.00 1.40 70.08 9.50 47.80 85.0039780.00
Washington 3559.00 4864.00 0.60 71.72 4.30 63.50 32.0066570.00
West Virginia 1799.00 3617.00 1.40 69.48 6.70 41.60 100.0024070.00
Wisconsin 4589.00 4468.00 0.70 72.48 3.00 54.50 149.0054464.00
Wyoming 376.00 4566.00 0.60 70.29 6.90 62.90 173.0097203.00
In [8]:
# US personal expense
USPersonalExpenditure
Out[8]:
19401945195019551960
Food and Tobacco22.244.559.673.286.8
Household Operation10.515.529.036.546.2
Medical and Health 3.53 5.76 9.7114.0021.10
Personal Care1.041.982.453.405.40
Private Education0.3410.9741.8002.6003.640
In [4]:
# Women 
women
Out[4]:
heightweight
158115
259117
360120
461123
562126
663129
764132
865135
966139
1067142
1168146
1269150
1370154
1471159
1572164

To get a list of all available built-in dataframes, use data()

In [7]:
data()

Working with DataFrames

You'll notice the states dataframe was really big, we can use the head() and tail() functions to view the first and last 6 rows respectively. Let's take a look:

In [9]:
# Quick variable assignment to save typing
states <- state.x77
In [11]:
head(states)
Out[11]:
PopulationIncomeIlliteracyLife ExpMurderHS GradFrostArea
Alabama 3615.00 3624.00 2.10 69.05 15.10 41.30 20.0050708.00
Alaska 365.00 6315.00 1.50 69.31 11.30 66.70 152.00566432.00
Arizona 2212.00 4530.00 1.80 70.55 7.80 58.10 15.00113417.00
Arkansas 2110.00 3378.00 1.90 70.66 10.10 39.90 65.0051945.00
California 21198.00 5114.00 1.10 71.71 10.30 62.60 20.00156361.00
Colorado 2541.00 4884.00 0.70 72.06 6.80 63.90 166.00103766.00
In [12]:
tail(states)
Out[12]:
PopulationIncomeIlliteracyLife ExpMurderHS GradFrostArea
Vermont 472.003907.00 0.60 71.64 5.50 57.10 168.009267.00
Virginia 4981.00 4701.00 1.40 70.08 9.50 47.80 85.0039780.00
Washington 3559.00 4864.00 0.60 71.72 4.30 63.50 32.0066570.00
West Virginia 1799.00 3617.00 1.40 69.48 6.70 41.60 100.0024070.00
Wisconsin 4589.00 4468.00 0.70 72.48 3.00 54.50 149.0054464.00
Wyoming 376.00 4566.00 0.60 70.29 6.90 62.90 173.0097203.00

DataFrames - Overview of information

We can use the str() to get the structure of a dataframe, which gives information on the structure of the dataframe and the data it contains, such as variable names and data types. We can use summary() to get a quick statistical summary of all the columns of a DataFrame, depending on the data, this may or may not be useful!

In [13]:
# Statistical summary of data
summary(states)
Out[13]:
   Population        Income       Illiteracy       Life Exp    
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
 Median : 2838   Median :4519   Median :0.950   Median :70.67  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
     Murder          HS Grad          Frost             Area       
 Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
 1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
 Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
 Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
 3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81162  
 Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  
In [14]:
# Structure of Data
str(states)
 num [1:50, 1:8] 3615 365 2212 2110 21198 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...

Creating Data frames

A quick note some people write Dataframe as one word, but in R its more commonly written as two words: data frame. Not a very huge deal either way, but if someone writes DataFrame they may be referring to a Python/pandas DataFrame, so keep that in mind!

We can create data frames using the data.frame() function and pass vectors as arguments, which will then convert the vectors into columns of the data frame. Let's see a simple example:

In [16]:
# Some made up weather data
days <- c('mon','tue','wed','thu','fri')
temp <- c(22.2,21,23,24.3,25)
rain <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
In [18]:
# Pass in the vectors:
df <- data.frame(days,temp,rain)
In [19]:
df
Out[19]:
daystemprain
1mon22.2TRUE
2tue21TRUE
3wed23FALSE
4thu24.3FALSE
5fri25TRUE
In [20]:
# Check structure
str(df)
'data.frame':	5 obs. of  3 variables:
 $ days: Factor w/ 5 levels "fri","mon","thu",..: 2 4 5 3 1
 $ temp: num  22.2 21 23 24.3 25
 $ rain: logi  TRUE TRUE FALSE FALSE TRUE
In [21]:
summary(df)
Out[21]:
  days        temp         rain        
 fri:1   Min.   :21.0   Mode :logical  
 mon:1   1st Qu.:22.2   FALSE:2        
 thu:1   Median :23.0   TRUE :3        
 tue:1   Mean   :23.1   NA's :0        
 wed:1   3rd Qu.:24.3                  
         Max.   :25.0                  

That's it for the basics, up next we will learn about selection and indexing Data Frame elements!